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Abstract 

Background: Out systematic review summarizes the evidence concerning the accuracy of serum diagnostic and prognostic 
tests for colorectal cancer (CRC). 

Methods: The databases MEDLINE and EMBASE were searched iteratively to identify the relevant literature for serum 
markers of CRC published from 1950 to August 2012. The articles that provided adequate information to meet the 
requirements of the meta-analysis of diagnostic and prognostic markers were included. A 2-by-2 table of each diagnostic 
marker and its hazard ratio (HR) and the confidence interval (CI) of each prognostic marker was directly or indirectly 
extracted from the included papers, and the pooled sensitivity and specificity of the diagnostic marker and the pooled HR 
and the CI of the prognostic marker were subsequently calculated using the extracted data. 

Results: In total, 104 papers related to the diagnostic markers and 49 papers related to the prognostic serum markers of 
CRC were collected, and only 19 of 92 diagnostic markers were investigated in more than two studies, whereas 21 out of 44 
prognostic markers were included in two or more studies. All of the pooled sensitivities of the diagnostic markers with > = 3 
repetitions were less than 50%, and the meta-analyses of the prognostic markers with more than 3 studies were performed, 
VEGF with highest (2.245, CI: 1.347-3.744) and l\/ll\/lP-7 with lowest (1.099, CI: 1.018-1.187)) pooled HRs are presented. 

Conc/usions:J\r\e quality of studies addressing the diagnostic and prognostic accuracy of the tests was poor, and the results 
were highly heterogeneous. The poor characteristics indicate that these tests are of little value for clinical practice. 
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Introduction 

Colorectal cancer (CRC) is one of the most common 
malignancies in developed countries [1]. The incidence of CRC 
in China was lower than that in the West but has increased in 
recent years [2] and has become a substantial cancer burden in 
China. The CRC mortality rate in China is 7.35/100,000 people, 
according to a retrospective survey on deaths caused by malignant 
tumors in China from 2004 to 2005 [3]. Each year in the United 
Kingdom and the United States, there are approximately 32,000 
and 160,000 new cases diagnosed, respectively, and approximately 
500,000 new cases diagnosed worldwide [4]. Despite advances in 
dosing and scheduling of chemotherapy in both adjuvant and 
advanced settings, early detection of CRC is always over- 
emphasized [5]. 

The FOBT (fecal occult blood test) and colonoscopy are the 
traditional methods for CRC screening. Although the FOBT is 
non-invasive and cheap, the lower sensitivity of the results makes it 
unacceptable for promotion and popularization [6]. Although 
colonoscopy plus biopsy is the gold standard of colorectal cancer 
screening and diagnosis because of the invasive nature and 



intestinal discomfort of colonoscopy, more than half of patients do 
not want it [7]. Compared with these screening methods, tests of 
serum biomarkers are more convenient and less invasive and can 
be more acceptable as part of a routine physical examination [8] , 
but most serum CRC markers still remain poor for most patients 
[9] . Although a number of serum markers of outcome in CRC 
have been reported [10], there has been no clear consensus as to 
their role, with many studies reporting conflicting results [1 1—13]. 

An important consideration is that a systematic review can 
highlight the underlying problems across individual studies and 
help identify the need for future research [14]. In the current 
paper, both of these aspects are addressed, and we hope that our 
findings will improve studies on CRC markers in the future. 

Materials and Methods 

Search strategy 

The systematic search addressed articles with information on 
markers in serum to include or exclude the presence of CRC 
published from January 1950 to August 2012. To fulfill our 
selection criteria, the studies had to have been published as a fuU 
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paper in English. Articles were identified by an electronic Medline 
and PUBMED search using the following keywords: 'Colorectal', 
'Colon', 'rectal', 'cancer', 'serum' and 'marker' (See Appendix 1 in 
Materials SI for the key words and corresponding "associated 
words"; see Appendix 2 for the details of search strategy). In the 
current study, duplicates from Medline and EMBASE were 
deleted automatically and manually with Reference Manager 
Version 1 1 (Thomson Reuters, New York, NY, USA). 

Inclusion and exclusion criteria 

For diagnostic marker(s), the meta-analysis focuses on the 
sensitivity and specificity of a marker, and the most basic 
requirement is a 2x2 table of outcome by marker index test to 
calculate the two values. A brief overview of the criteria for a 
diagnostic marker is the following: 

1 . The original article is in English and about diagnostic 
serum marker(s) of only primary colorectal cancer 
(CRC, colon or rectal cancer). 

2. There is enough information to directly or indirectly 

construct a 2x2 table(s) of outcome by the marker(s) index 
test. 

3. The gold standard (reference standard) for the diagnosis of 
CRC, colon or rectal cancer is based on clinical 
histopathology. 

4. Only patients (CRC, colon or rectal cancer) versus 
control (healthy population) are examined. 

Auxiliary information such as study design and cut-olf values 
(see Table S3 of our manuscript) is not very important for 
quantitative synthesis of elfect sizes of a diagnostic marker. We 
summarized study designs for studies with the following designs: 
case-control, retrospective case-control, prospective cohort, nested 
case-control, prospective nested case-control, cohort, prospective 
cohort and cohort of consecutive patients (see Table S3 for details). 

For prognostic marker(s), the study must provide time-to-event 
data, and the meta-analysis focuses on hazard ratio (HR) and its 
confidence interval (CI) 

1 . An original paper based on a primary CRC, colon or 
rectal cancer in English had to provide a quantitative result 
or give tabulated individual patient data (IPD) [15] to assess 
the ability of one or more prognostic serum markers. 

2. The study should provide sufficient data to (re)construct a 2 x2 
table to estimate the marker's prognostic accuracy or the log of 
the hazard ratio (HR) and its precision (the variance or 
standard error (SE)) or the HR and its confidence 
interval (CI). 

In addition to the above 2 items. th(- rest of the items are the 
same as items 3 and 4 for diagnostic markers. 

From papers classified as 'relevant,' information was extracted 
on the tumor marker used, the clinical area of application, the age 
range of patients, stage of disease, whether the outcome was 
overall survival (OS) or disease-free survival (DFS), and the cut-olf 
level of the marker (See Table S5 of our manuscript for the 
details). 

Two stages were needed to include or exclude the candidate 
articles. The first batch of reviewers, who were trained in advance, 
assessed the titles and abstracts, and then, the second independent 
batch of reviewers, who were trained in advance, assessed the fuU 
articles to assure that no relevant articles were excluded. Inclusion 
or exclusion, as well as data extraction for any paper, was 
implemented by at least two independent reviewers, and if the 



extracted data were not the same, conflicts were resolved by 
reaching a consensus. 1) If more than one marker was used in a 
given study, the relevant data for each eligible marker was 
individually extracted. 2) If one marker had multiple functions (i.e., 
one marker for one disease is used for screening, diagnosis, 
prognosis and/or monitoring), the datasets corresponding to the 
multiple functions were extracted separately. 3) If there were 
multiple markers and diseases addressed in one study, only the 
relevant data from the marker(s) corresponding to each disease of 
interest to the author(s) was extracted. 

Data extraction 

From papers classified as "relevant," information was extracted 
on the study characteristics, the participant characteristics, the 
type of reference test used to confirm the presence or absence of 
colorectal cancer, the tumor marker used, the clinical area of 
application, the age range of patients, the stage of disease, whether 
the outcome was overall survival (OS) or disease-free survival 
(DFS), and the cut-olf levels as well as how these levels were 
determined. Some of the studies had several different cut-off levels, 
and we only took the one closest to the cut-off corresponding with 
95% specificity (avoiding false positives as much as possible) [16] 1) 
For diagnosis-related papers, the data extraction and methodo- 
logical quality assessment of each included study were generally 
performed simultaneously. Whiting et al. (2003) proposed a set of 
criteria for the Quality Assessment of Diagnostic Accuracy Studies 
(QUAD AS) that applies well to diagnostic marker studies [17]. 
Additional information to be extracted included the number of 
patients and controls and the numbers of true positives (TP) /false 
positives (FP)/true negatives (TN)/false negatives (FN), which are 
mandator)'. In addition, the sensitivity and specificity, the 95% 
confidence intervals (CIs), the overall accuracy, the positive 
predictive value (PPV = TP/(TP4-FP)), the negative predictive 
value (NPV = TN/(TNh-FN)), the positive likelihood ratio (LR-h), 
the negative likelihood ratio (LR— ), and the diagnostic odds ratio 
(DOR) of the tumor markers were optional extracted information. 
If a study lacked the mandatory information, we calculated the 
TP/FP/TN/FN and filled in tiie blanks in the table. 2) For 
prognosis-related papers, Altman et al. (2012) proposed reporting 
recommendations for tumor marker prognostic studies (RE- 
MARK) [18] that apply well to prognostic marker studies. The 
data extraction and conversions for prognostic markers were much 
more complex than for diagnostic markers because prognostic 
markers provide time-to-event data. Meta-analyses of this type of 
marker often require one of two types of data, i.e., the log of the 
hazard ratio (HR) and its precision (the variance or standard error 
(SE)) or the HR and its confidence interval (CI). For major 
prognostic markc-r studies, tlu; two t)'pes of data cannot be 
extracted directiy. Paramar and colleagues [19] presented a series 
of simple methods to extract the relevant data from publications 
with the aim of performing a meta-analysis of survival-type data. 
The methods focus on approaches for extracting these data from 
publications and are illustrated throughout this pubfication with 
real examples. Riley and co-workers (2003) [20] summarized 11 
methods (Appendix 3) that are available for directly or indirectiy 
estimating these data and the approximate normal loge (HR) 
distribution for large samples. In addition, Tierney et al. [21] 
provided step-by-step guidance for how to calculate an HR and 
the associated statistics for individual trials, according to the 
information presented in the trial report. In our study, an R 
package was developed based on the methods of Paramar and 
colleagues [19] and was applied to indirectly or direcdy calculate 
the HR and its CI. 
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Statistical analysis and data synthesis 

The systematic review process followed the guidelines published 
by the NHS Centre for Reviews and Dissemination and had an 
overall objective of maintaining breadth, synthesizing the evidence 
qualitatively and then, only where appropriate, using quantitative 
methods [22,23]. 

Diagnostic serum markers 

Meta-analysis of diagnostic test accuracy presents many 
challenges. Even in the simplest case, when the data are 
summarized by a 2x2 table from each study, a statistically 
rigorous analysis requires hierarchical (multilevel) models that 
respect the binomial data structure. In the current study, the forest 
plots of sensitivity and specificity estimates and their 95% CIs were 
constructed from every study using MetaDiSc software (version 
1.4) [24], with the heterogeneity of the accuracy estimates assessed 
with the statistic [25] . The summary estimates of sensitivity and 
specificity were calculated using the package Metandi for STATA 
1 1 statistical software (STATA Corp, College Station, TX) [26] 
(Metandi requires either Stata 10 or above). We also adopted a 
command, metandiplot, to simplify the plotting of graphical 
summaries of the fitted model, namely, the summary receiver 
operating characteristic (SROC) curve and the prediction region 
and also to plot the summary point and its confidence region. 

It has been argued that diagnostic accuracy test may be 
particularly susceptible to publication bias [27]. Simulation studies 
have, however, indicated that the effect of publication bias on 
meta-analytic estimates of the Diagnostic Odds Ratio (DOR) is not 
likely to be large, and its assessment in reviews of test accuracy is 
complex [28]. An alternative approach uses funnel plots of (natural 
logarithm (In) DOR) vs (l/\/effective sample size) and tests for 
asymmetry using related regression or rank correlation tests [28]. 
It should be noted that the power of all statistical tests for funnel 
plot asymmetry decreases with increasing heterogeneity of DOR. 

Prognostic serum markers 

The hazard ratio (HR) was used to measure the impact of the 
expression of individual biomarkers on prognosis. From papers 
classified as 'relevant', information was extracted on the tumor 
marker used, the clinical area of application, the age range of the 
patients, the stage of disease, whether the outcome was overall 
survival (OS) or disease-free survival (DFS), and the cut-off level of 
the marker. OS, DFS, or unclear were recorded to classify the 
outcom(; of a marker, where a\ ailable, and separated according to 
whether they had been analyzed by univariate or multivariate 
analysis. Disease-specific survival (DSS) was included under OS, 
and distant disease-free survival (DDFS) and metastasis-free 
survival (MFS) were included under DFS. For both OS and 
DFS, the following were retxjrded (where available): whether the 
marker for analysis had a significant association with survival, the 
hazard ratio (HR), the 95% confidence intervals (CI), the p value 
for the factor, whether the p value was exact, and whether the 
survival had been analyzed by univariate and/or multivariate 
analysis. If multivariate analysis had been performed, other factors 
included in the model were also recorded. Because the estimate 
measure of HR varied, we converted the different statistics into the 
HR, 95% CI, and its variance, which were more accurate and 
united. After obtaining the basic statistics, a sequential process 
based on the appropriate command in STATA version 10 (Stata 
Corporation, College Station, TX, USA) was implemented to 
count the pooled HR value. The process followed the research of 
RD Riley [20]. 



Pooled estimates of the HRs were obtained using both fixed- 
effect and random-effect meta-analyses using the inverse-variance 
weighting method. Statistical heterogeneity between studies was 
assessed using the among-study variance (s2) and the statistic I^ 
[25]. We conducted heterogeneity x2-tests, and if the assumption 
of homogeneity of individual HRs had to be rejected, we used a 
random-effect model in place of a fixed-effect model. By 
convention, an observed HR>1 imphed a worse prognosis for 
the group with positive marker expression. We performed a meta- 
analysis of prognostic test accuracy using the metan command in 
STATA. Publication bias refers to the phenomenon of studies with 
uninteresting or unfavorable results being less likely to be 
published than those with more favorable results [29]. If a 
publication bias exists, then the published literature is a biased 
sample of all studies on a topic, and any meta-analysis based on it 
will be similarly biased. Funnel plots are commonly used to 
investigate publication and related biases in meta-analyses [30]. 
The metabias function in STATA performs the Begg and 
Mazumdar [31] adjusted rank correlation test for publication bias 
as well as the Egger et al. [32] regression asymmetry test for 
publication bias. As options, it provides a funnel graph of the data 
or the regression asymmetry plot. The Begg adjusted rank 
correlation test is more popular in common applications for 
publication bias analysis, and it is used to estimate the publication 
bias in our study. The "trim and fiU" method [33] was 
implemented to explore the possible nature of studies "missed" 
in the review and to attempt to estimate the "true" relative risk 
estimate accounting for publication bias. The command metatrim 
in STATA is used to implement the Duval and Tweedie 
nonparametric "trim and fiU" method. 

Results 

Searching results 

In total, 2243 articles were obtained from the two databases, of 
which 153 articles reporting on 114 CRC serum diagnostic and/or 
prognostic markers (Table SI) were considered as relevant 
according to the first two reviewers. A total of 105 papers 
(Appendix 4) were related to diagnosis, whereas 49 (Appendix 5) 
were prognosis papers. Furthermore, 23 of the relevant papers 
include both diagnosis and prognosis. In these studies, a total of 
257 individual tumor markers were obtained. Papers indicating 
related studies in the specific area were studied further to seek 
more relevant results. The process of retrieving and reserving 
papers and the results are shown in Figure 1. 

Tumor Markers Identified Overall and Within Each Clinical 
Area 

Assessment of study quality and Investigated diagnostic 
serum markers. The quality of diagnosis papers was assessed 
by using the QUAD AS system [1 1] . The methodological quality of 
the studies with a focus on the objective of this review was 
generally poor and are shown in Figure 2, with specific details in 
Table S2 (references to these studies are prefaced by a 'D' and are 
fisted in Appendix 4 in Materials SI). Of the studies, 12 papers 
were designed using a prospective cohort study. The rest of studies 
used case-control methods. Therefore, verification bias inevitably 
appeared in those studies. Verification bias is the result of 
identifying experimental groups by the gold standard reference test 
of a disease or condition, such as cancer, whereas the control 
group is presumed to be free of this condition, but this is not 
verified by the gold standard reference test, which inflates 
sensitivity and decreases specificity [34-36]. Moreover, most 
studies did not have an adequate description of the patient- 
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PubMed citations (n=1679) 
EMBASE citations (n=1329) 



n=2243 (left) 



Exclusion of duplicates 



Exclusion 



Review or meta-analysis; Combination marl<er; Not an original 
article; Non-English paper; Not primary CRC, colon or rectal cancer; 
Sample size of patients or controklO; Not serum marker of CRC, 
colon or rectal cancer; Non-patients (CRC, colon, rectal cancer) VS 
healthy population 



Remaining studies for detailed review (n=167) 



Classification by diagnosis and prognosis marker 



Diagnosis papers (n=56) 



Prognosis papers (n=49) 



Data extraction (See items in 
supplementary Table 2, 3 for details) 



Individual biomarkers obtained (n=257) 

Exclusion 

Sensitivity and specificity for diagnosis marker could be not calculated. 
HR and CI for prognosis marker could not be calculated 



Serum markers could be analyzed 



> Classification statistic 



Diagnosis biomarkers with 
full information (n=79) 



Prognosis biomarkers with 
full information (n=80) 
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Figure 1. The flowchart of the selection of the relevant articles. 

doi:1 0.1 371 /journal.pone.01 0391 0.gOOl 



selection procedure, the characteristics of tlie study participants, 
the reference standard, and the used cut-off value of the marker. 
The time between the index test (marker) and the reference test as 
well as the availability of other clinical data (as is commonly 
encountered in practice) were also poorly reported. 

Table S3 provides a complete summary of the performance of 
all markers across the included studies. In total, 92 serum markers 
were identified, and only a few markers are frequendy reported. 
Of those markers, 73 markers are only reported one time. The 
most frequently evaluated serum marker was CEA (42 repetitions) 
foUowed by CA19-9 (24), CRP (9), CA-50 (7), CA72-4 (7), and 
VEGF (7) (Table 1). Some reviews may not result in useful 
summary estimates of sensitivity and specificity, for example, 
because of substantial variability in the individual study estimates 
or because the number of the relevant studies corresponding to a 
marker is less than three. Several methods of meta-analyzing 
diagnostic accuracy data have been proposed, of which, two are 
statistically rigorous: the hierarchical summary receiver operating 



characteristic (HSROC) model [37] and the bivariate model [38]. 
In current systematic review, the summaries of the diagnostic 
accuracy of those markers, respectively assessed by the hierarchical 
summary receiver operating characteristic (HSROC) curve [39] 
(study number>three) and the forest plot of meta-analysis (study 
number >2), are shown in Table 1. CEA is the most frequently 
studied biomarker based on the extracted biomarker information. 
In total, there are 42 papers presenting the diagnostic results for 
CEA. The CEA studies included 8861 individuals, of which 5361 
were patients, and the remaining 3500 individuals were controls. 
The cut-ofF value ranged from 2.40 ng/ml to 10.0 ng/ml. The 
sensitivity and specificity ranged widely from 25.55% to 97.22% 
and 54.40% to 100.00%, respectively. 

Figure 3 A presents hierarchical summary estimates of sensitiv- 
ity and specificity for CEA after back-transformation to ROC 
axes. Furthermore, it shows the 95% confidence ellipse around the 
mean values of sensitivity and specificity for CEA and a 95% 
prediction ellipse for the individual values of sensitivity and 



Quadas Assessment 



representative patient sanple 



stLfdy participants clearly described 



selectityi criteria clearly described 



adequate reference standard 



acceptable delay between tests 



partial verification avoided 



differential verification avoided 



corporation avoided 



adequate index test description 



cut-off value clearly described 



adequate reference standard description 



blinding for reference test results 



bl inding for index test results 



cl inical data avaiable as in practice 



uninterpretable test results reported 




Figure 2. Summary of quality of the included studies, according to the QUADAS criteria (see Table S2 for details). 
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Figure 3. The ROC and forest plots of summary estimates of sensitivity and specificity of diagnostic marker CEA. A is the ROC plot of 
the hierarchical summary estimates of sensitivity and specificity for CEA with 95% confidence and prediction ellipses. B and C are forest plots of 
sensitivity and specificity of the diagnostic marker CEA for colorectal cancer plotted with a HSROC model. The size of the squares in B and C are 
proportional to the study size and weight for each study. The rhombus represents the pooled estimates, which are 0.461 (CI: 0.448-0.474) and 0.892 
(CI: 0.882-0.902) for specificity and sensitivity, respectively. 
doi:1 0.1 371/journal.pone.01 0391 0.gOOS 



specificity. The ellipse around the summary or mean estimate of 
sensitivity and specificity marks the region containing likely 
combinations for whicli the mean value of sensitivity and 
specificity is small. The 95% prediction ellipse is wider and 
indicates more uncertainty as to where the likely values of 
sensitivity and specificity might occur for individual studies. 
Figures 3 B and C separately present the forest plots of the 
specificity and sensitivity of the diagnostic marker CEA for 
colorectal cancer with individual study estimates of the sensitivities 
and specificities and the 95% CIs as a random-effects model. The 
simple summary estimates of the sensitivity and specificity of CEA 
for colorectal cancer were 46.1% (95% CI: 44.8-47.4%) and 
89.2% (95% CI: 88.2-90.2%), respectively. The HSROC model 
produced the same summary estimates of sensitivity and specificity 
with almost exactiy equal CIs (48.5% (95% CI: 44.8-52.3-46.7%) 
and 91.1% (95% CI: 88-93.0%), respectively) that take into 
account the heterogeneity beyond chance between studies 
(random-effects model). For the remaining serum markers for 
CRC, the pooled sensitivities and specificities with their CIs are, 
respectively, listed in the 6''^ and 7''^ columns in Table 1, but the 
HSROC plots and forest plots are presented in Appendix 6 in 
Materials SI because of article length limits. Publication bias 
analyses were implemented for the prognostic markers with more 
than three repetitions in studies. The results are shown in the 
12th— 15th column in Table 1, and the characteristics of those 
makers are listed in Table S3. The corresponding forest plots and 
funnel plots are shown Appendix 7 in Materials SI. The results 
indicate that the publication bias exist for almost all diagnostic 
markers. 

Assessment of study quality and Investigated prognostic 
serum markers. The scores of all prognostic studies by 
REMARK [18] are shown in Table S4. The scores of these 
studies ranged between 16 and 19. Table S5 provides a complete 
summary of the performance of all prognostic markers for CRC, 
across the included studies. In total, 41 serum prognostic markers 
were identified, and only a few markers were frequendy reported. 



Of those markers, 22 markers were only reported one time, 13 
markers were reported twice, and only 10 markers were reported 
more than three times. The most frequendy evaluated serum 
prognostic marker was CEA (34 repetitions) followed by CA19-9 
(10), VEGF (9), MASP-2 (6), CRP (5), TIMP-1 (4), YKL-40 (3), 
MMP-7 (3), PAI-1 (3), and suPAR (3). The prognostic markers 
with more than three repetitions were chosen for the meta-analysis 
and publication bias analysis using STATA (10 version) software, 
and the summaries are given for each marker in Table 2. 

The most frequently reported prognostic marker for CRC is 
CEA. The CEA studies included 5792 patients, of which 3856 
patients had positive results for the CEA marker, whereas 1936 
patients were negative. The cut-off values ranged from 2.7 ng/ml 
to 10.0 ng/ml. The median patient age across all trials was 
between 47.74 and 73 years, with an age range of 31 — 90 years. 
AH patients had histologically or cytologicaUy confirmed CRC, 
colon or rectal cancer, as the primary diagnosis. There are 28 
articles related to CEA and the prognosis outcome of the patients, 
of which 6 articles studied both the overall survival (OS) and 
disease-free survival (DFS). There are 9 articles that do not state 
whether they studied the OS or DFS; we defined these as 
"unclear" (Table 2). A summary of the individual trials and overall 
pooled results from the primary analysis of the overall survival is 
shown in Figure 4. According to the outcomes (OS, DFS and 
unclear), the CEA was classified into three subgroups, and the 
three subgroup datasets were separately submitted to the meta- 
analysis and publication bias analysis. As a result, the pooled HRs 
with 95% CIs of OS, DFS, and unclear subgroups were 1.624 
(1.290-2.043), 1.453 (1.267-1.666), and 2.208 (1.479-3.297), 
respectively, and the overall HR (CI) from the three combined 
subgroups was 1.513 (1.391-1.645) (Figure 4 A). After analysis of 
the publication bias by the "trim and fiU" method, the OS, DFS, 
and unclear subgroups were added with three, seven, and one 
"missing" studies (Figure 4 B C and D and Table 2), respectively. 
The adjusted HRs with the 95% CIs for the three subgroups were 
1.346 (1.083-1.671), 1.166 (1.018-1.336) and 2.073 (1.410-3.047), 
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respectively. In contrast, all adjusted HRs were relatively smaller 
than the unadjusted HRs (Table 2, panel CEA). Likewise, the 
same methods of meta-analysis and publication bias analysis were 
implemented for the remaining prognostic markers with more 
than three repetitions in studies on CRC. The results are shown in 
Table 2, and the characteristics of those makers are listed Table 
S5. The corresponding Forest plots and funnel plots are shown 
Appendix 8 in Materials SI. 

Discussion 

Appraisal of the Systematic Review 

In our study, we performed a systematic review and meta- 
analysis for all of the published CRC serum biomarkers. Through 
the investigation, we searched 1 1 4 serum biomarkers (for diagnosis 
92, for prognosis 41), of which 20 biomarkers can both act as 
diagnosis and prognosis markers. Most of the markers have been 
pubhshed only once, and the most frequently reported top three 
markers for diagnosis are CEA (42 studies), CA19-9 (25 studies), 
and CA242 (10 studies), and for prognosis, they are CEA (34 
studies), CA19-9 (10 studies), and VEGF (9 studies). For the 
diagnosis markers that were studied more than twice, we used the 
HSROC model and meta-analysis approach for the sensitivity and 
specificity correlation analysis. The results suggested that almost all 



of the pooled sensitivities of the diagnosis markers were less than 
50% and followed by significant heterogeneity. Publication bias 
exists for major diagnostic serum CRC markers by an alternative 
approach using funnel plots of (natural logarithm (In) DOR) vs 
(l/\/effective sample size) [28]. Likewise, meta-analyses and 
publication bias analysis were implemented for the prognostic 
markers with more than three repetitions in studies. The range of 
all of the pooled HRs is from 1 to 2, which indicates there wiU be 
no survival rate differences between the positive and negative 
patients. According to our analysis, we may explain why those 
reported diagnostic and prognostic markers of CRC are not 
suitable for clinical applications. Because most of the pooled 
sensitivities of the diagnosis markers were less than 50%, and the 
heterogeneity was significant, and the pooled HRs of the prognosis 
markers were greater than 1 and less than 2. 

The ideal study sample for a test accuracy study is a consecutive 
or randomly selected series of patients in whom the target 
condition is suspected, or for screening studies, the target 
population. There are two basic types of test accuracy studies: 
cohort studies and case-control studies. Both diagnostic and 
prognostic studies included in the current systematic review 
predominantly belong to the case-control design type, which is 
liable to bias [40]. Diagnostic or prognostic tests perform 
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Figure 4. Meta-analysis plots of the progression-free and overall survival hazard ratios in individual trials. A is the forest plot and B, C, 
and D are the "filled" funnel plots of OS, DFS, and the unclear group, respectively. The meta-analysis displayed a significant effect in favor of a high 
volume. The pooled and filled results are presented in Table 2. 
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differently in different populations [41,42], It is important to 
clearly define the population of interest. In our systematic review, 
the study population is limited to primary CRC. 

Analysis of potential reasons for publication and 
heterogeneity observed 

A potential source of bias (i.e., publication bias) is whether aU 
relevant studies have been identified, and a small number of part- 
published studies may have been omitted. From table 1 and 
table 2, both diagnostic and prognostic studies have publication 
bias. In our search strateg}', although we included as many key 
words and relevant works in our initial search strategy as possible, 
we acknowledge the possibility that this review was not exhaustive, 
reflecting publication and reporting bias. The reasons for 
"missing" papers may include the following: (1) the key words 
and relevant words related to CRC may not be fully comprehen- 
sive; (2) we did not search aU of the literature databases (only 
EMBASE and PubMed were searched, but we believe these two 
databases include the majority of candidate papers); (3) we did not 
include non-English language papers because of the difiiculties in 
translation, and this may have introduced bias if statistically or 
clinically significant studies were more likely to be (re)written for 
publication in an English language journal [43]; (4) a few articles 
were found in the two databases, but they could not be 
downloaded, in part because they were published too long ago 
or the journal that published those articles is too unpopular; (5) 
some papers did not provide a complete report of the data in the 
original article. Despite these concerns, the papers included in our 
study account for the vast majority of all papers relevant to CRC, 
and we believe that the final results are representative of the 
significance. 

Another potential source of bias specific to this study is that of 
overlapping datasets. In our research, we minimized this bias by 
excluding such datasets, replacing these with only the most recent 
study. 

Heterogeneity between studies may represent a further potential 
source of bias, but it is indispensable for any meta-analysis that 
potential sources of heterogeneity are examin(^d, and variability 
beyond chance can be attributed to betwecn-study differences in 
the selected cut-point for positivit)', in patient selection (such as: 
severity of illness, age, gender and etc.) and clinical setting (such as 
dose, timing or duration of treatment), in the type of test used, in 
real variation in the treatment effect, in the type of reference 
standard, or any combination of these factors. In addition, 
heterogeneit)' in study results can also be caused by flaws in study 
design [44] . In reviews of studies on the prognostic accuracy of 
tests, heterogeneity may be influenced by duration of follow-up or 
the reliability of outcome measures [45]. To overcome the 
problem of heterogeneity, we provide some suggestions to improve 
study design standards and design large prospective studies to 
answer pre-specified questions of clinical interest. Weakness of 
reporting, analysis and prc'sentation of results \\-as frerjuently 
apparent throughout the evaluation of the selected papers. The 
presentation of survival analyses was particularly poor and the HR 
and its CI were often not reported direcdy. Accordingly, we can 
promote better reporting. We should conduct large prospective 
multi-center studies, and the multi-disciplinary teams can collab- 
orate to seek consistency in cut-offs, adjustment factors, outcomes, 
analysis, measurement methods and other relevant variables. 

Interpretation of the diagnostic serum markers 

For the diagnostic markers of CRC, various aspects, such as the 
diverse populations used (different age, origin, "normal," or 
diseased controls), the diverse number of markers evaluated (single 



versus combined markers), and the use of different cut-off points 
for the same marker, result in an order of magnitude range of 
sensitivities and specificities reported for the various markers. 
Moreover, the majority of the markers (73/91, 80.2%) were 
evaluated in only one study (Table S3). Interpretation of many 
studies is fiirther limited by the selection of cases and controls 
because only case-control studies may overestimate the sensitivity 
and specificity [46-48]. In case-control studies, the case group of 
patients may include an order of magnitude range with different 
pathological grades, ages, genders, regions and ethnicities. On the 
other hand, the controls had often not undergone colonoscopy. 
These control groups most likely included a substantial proportion 
of adenoma carriers because the prevalence of adenomas among 
older adults is estimated to be approximately 20% to 30% [49- 
51]. In CRC marker studies, the patient group should be 
compared with multiple control groups, such as other types of 
cancer and other intestinal diseases, advanced adenoma cases and 
a normal healthy population. Without these comparisons, the 
marker cannot be exactiy correlated to CRC, and the specificity 
may be inaccurately estimated in such studies. In addition, the 
effect of the value of a new CRC serum marker is not reliable 
because of the lack of double-blind randomized clinical trials. 
Another concern refers to the comparability of results across 
studies given the potential differences in serum collection, 
processing, and storage methods, and uncertainties in the stability 
of several biomarkers. Information on these issues is very limited. 
AU of the above-mentioned factors may cause variation in the 
results for markers of CRC, leading to imprecisely pooled results in 
the meta-analysis. 

Interpretation of the prognostic serum markers 

Prognostic research has, to date, received much less attention 
than research into therapeutic or diagnostic areas, and an 
evidence-based approach to the design, conduct and reporting of 
primary studies of prognostic markers is needed [52]. Reviews 
have demonstrated that primary prognostic studies are often of 
poor quality [53]. Furthermore, synthesis of prognostic studies is a 
relatively new and evolving area in which the methods are less well 
developed than for reviews of therapeutic interventions or of 
diagnostic accuracy and available reviews have often been of poor 
quality [54-57]. For prognostic markers, apart from the duration 
of follow-up, the various aspects leading to heterogeneity observed 
are almost similar to those for diagnostic markers. Throughout the 
evaluation of the 49 selected papers, weaknesses in the analysis, 
reporting, and presentation of the results were frequently 
apparent. The poorly presented survival analyses emphasize the 
problems addressed in the recommendations by Altman and 
colleagues [58]. For example, to conduct the meta-analyses, we 
made 120 attempts to obtain estimates of the HR and its CI from 
the data/ results provided, but only 79 of these proved successful. 
The remaining 41 were indirectiy calculated using the raw 
individual patient data available or the survival curve plot. The 
HR and its CI (or loge(HR) and its variance) provide an important 
estimate of the difference in the risk of death (for OS) or disease 
recurrence/death (for DFS) between two groups of patients, but 
this is often given only as an inexact p value. 

The indirect methods suggested by Parmar and colleagues [19] 
were found to be particularly crucial. To maximize the raw data 
mining, 1 8 arguments (see materials and methods for the details) in 
the article were extracted to indirectiy calculate the InHR and 
varlnHR. In some articles, the authors did not report the 
individual personal data (IPD) or the 18 arguments. However, 
the survival curve plot(s) were iUustrated, and an R package was 
developed to extract the data to indirectiy obtain the InHR and 
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varlnHR. This approach represents an innovative extension of the 
1 1 methods summarized by Riley and co-workers [20] (Appendix 
3). 

Clinical validities of CEA and CA19-9 

We specifically investigated the chnical practices of the top two 
most studied markers, CEA and CA19-9, which are both 
diagnostic and prognostic markers for CRC and have significant 
heterogeneity and asymmetry. For CEA, a lack of sensitivity and 
specificity, when combined with the low prevalence of CRC in 
asymptomatic populations, preclude the use of CEA in screening 
for CRC [59—61]. In agreement with American Society of Clinical 
Oncology (ASCO) [62,63] and European Group on Tumor 
Markers (EGTM) recommendations [64,65], the National Acad- 
emy of Clinical Biochemistry (NACB) Panel states that CEA 
cannot be used in diagnosis healthy subjects for early CRC. The 
patient stage at initial diagnosis is universally used to determine 
prognosis in patients with CRC. Several studies, however, have 
demonstrated that preoperative concentrations of CEA can also 
provide prognostic information which, in some situations, has been 
found to be independc-nt of stage [59-61,66]. Indeed, in some 
studies, CEA was found to be prognostic in patients with Stage II 
disease [59-61]. Preoperative concentrations of CEA might thus 
be combined with other factors to identify those Stage II colonic 
cancer patients who are candidates for adjuvant chemotherapy. 
There is, however, no evidence at present for a beneficial effect of 
adjuvant chemotherapy in either Stage II patients, as a whole, or 
in those with Stage II disease and high preoperative serum CEA 
concentrations. In agreement with other expert panels [62-65], 
the NACB Panel states that preoperative CEA levels should be 
measured in newly diagnosed CRC patients. CEA levels may be 
combined with histopathological parameters to determine which 
patients with Stage II colon cancer should receive adjuvant 
chemotherapy. However, as mentioned above, there is currendy 
no evidence that Stage II colon cancer patients with elevated 
concentrations benefit from adjuvant chemotherapy. The CA 19-9 
assay detects a mucin containing the sialated Lewis-a pentasac- 
charide epitope, fiicopentaose II [67]. CA 19-9 is a less sensitive 
marker than CEA for CRC [68,69]. Preliminary findings suggest 
that like CEA, preoperative concentrations of CA 19-9 are also 
prognostic in patients with CRC [70-75]. Based on a\ ailable data, 
routine measurement of CA 19-9 as both diagnostic and 
prognostic markers cannot be recommended by either the ASCO 
[76] or EGTM [77] for patients with CRC. 

Conclusions 

Our systematic review summarizes the evidence about the 
accuracy of serum diagnostic and prognostic tests for colorectal 
cancer (CRC). However, the majority of these markers have only 
been reported in a single study (diagnostic markers: 73 in 92, 
79.3%; prognostic markers: 23 in 44, 52%). The cut-offs of those 
markers with more than three repetition studies present apparent 
fluctuations, and the effect sizes of the same marker in different 
studies generally demonstrate significant heterogeneity. The 
quality of studies addressing the diagnostic and prognostic 
accuracy of tests was poor, and the results were highly 
heterogeneous. Thus, like many reviewers of such studies, the 
present authors do not feel that the existing literature is strong 
enough to form a basis for clinical decisions, but the current 
systematic review can, we believe, highlight underlying problems 
on CRC serum markers and improve studies on CRC markers in 
the future, for example, exploring novel marker or constructing a 



"combination" marker composed of a few high-weights markers to 
arrive at clinically useful requirements. 

Supporting information 

Table SI List of serum or plasma markers in CRC that 
were identified by the systematic review together with 
the number of papers overall and within each clinical 
area. 

(XLS) 

Table S2 Study characteristics and quality of the 
included studies. See Whiting et al. [1 7] for criteria on quahty 
assessment. Items were scored 1 = yes, 2 = no, 3 = unclear. The 
reference IDs in the 2nd column are prefaced by a 'D' and listed in 
Appendix 4 in Materials SI 
PCLS) 

Table S3 The complete summary of the performance of 
all serum diagnostic markers for colorectal cancer. Note: 
reference IDs to these studies are prefaced by a 'D' and listed in 
Appendix 4 in Materials SI. In the Data acquisition(direct/ 
indirect) column, 'd' means that the four core values. True Positive 
(TP), False Positive (FP), True Negative (TN) and False Negative 
(FN), of one diagnostic marker can be directiy extracted from the 
study; otherwise, 'i' means the four values can indirectly extracted 
from the study and extrapolated from other relevant values. N/A 
means the value is not available. OR = (TP/FP)/(TN/FN), 
varlnOR= 1/TP-l-l/FP-l-l/TN-l-l/FN. 
(XLS) 

Table S4 Study characteristics and quality of included 
prognosis papers. Notes: An assessment of study methodology 
was performed according to REMARK study design [18], which 
includes 20 items. For any criterion not fulfilled according to the 
REMARK requirement, one point was deducted from a 
maximum of 20. Two independent investigators were assessed 
the c'ligibility criteria and quality scoring. Any disagreement was 
resolved by discussion. The scores of these studies ranged between 
16 and 19. 
PCLS) 

Table S3 Studies investigating the prognostic serum 
markers of colorectal cancer. Note: The reference IDs to 
these studies are prefaced by a 'P' and listed in Appendix 5 in 
Materials SI. No of Patients (+) or (— ) means the number of 
patients with positive or negative serological test results, defined by 
the level of the colorectal cancer marker. OS: overall survival; 
DFS: disease-free survival. In the Data acquisition (direct/indirect) 
column, 'd' means that the three core values. Hazard Ratio (HR), 
Lower Limit (LL), and Upper Limit (UL), of one prognostic 
marker can be directiy extracted from the study; otherwise, 'i' 
means that the four values can be indirectiy extracted from the 
study and extrapolated from other relevant values. 
(XLS) 
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